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Abstract. Irregular computations on unstructured data are an impor¬ 
tant class of problems for parallel programming. Graph coloring is often 
an important preprocessing step, e.g. as a way to perform dependency 
analysis for safe parallel execution. The total run time of a coloring algo¬ 
rithm adds to the overall parallel overhead of the application whereas the 
number of colors used determines the amount of exposed parallelism. A 
fast and scalable coloring algorithm using as few colors as possible is vi¬ 
tal for the overall parallel performance and scalability of many irregular 
applications that depend upon runtime dependency analysis. 

Qatalyiirek et al. have proposed a graph coloring algorithm which relies 
on speculative, local assignment of colors. In this paper we present an 
improved version which runs even more optimistically with less thread 
synchronization and reduced number of conflicts compared to Qatalyiirek 
et al.’s algorithm. We show that the new technique scales better on multi¬ 
core and many-core systems and performs up to 1.5x faster than its pre¬ 
decessor on graphs with high-degree vertices, while keeping the number 
of colors at the same near-optimal levels. 
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1 Introduction 

Many modern applications are built around algorithms which operate on irreg¬ 
ular data structures, usually in form of graphs. Graph coloring is an important 
preprocessing step, mainly as a means of guaranteeing safe parallel execution in 
a shared-memory environment but also in order to enforce neighborhood heuris¬ 
tics, e.g. avoid having adjacent graph edges collapse in sequence in graph coarsen¬ 
ing [6]. Examples of such applications include iterative methods for sparse linear 
systems [14], sparse tiling [19,20], eigenvalue computation [16], preconditioners 
[18,12] and mesh adaptivity [7,10]. 


Taking advantage of modern multi-core and many-core hardware requires 
not only algorithmic modifications to deal with data races but also considera¬ 
tion of scalability issues. The exposed parallelism of an irregular algorithm is 
directly dependent on the number of colors used. The lower this number, the 
more work-items are available for concurrent processing per color/independent 
set. Additionally, there is usually some thread synchronization or reduction be¬ 
fore proceeding to the next independent set. A poor-quality coloring will only 
exaggerate the effects of thread synchronization on the parallel scalability of an 
application. Following this observation, it is obvious that a good coloring algo¬ 
rithm should be fast and scalable itself, so as to minimize its own contribution 
to the total execution time of the application, and use as few colors as possible. 

The simplest graph coloring algorithm is the greedy one, commonly known 
as First-Fit (§2.1). There exist parallel versions for distributed-memory environ¬ 
ments, but in this paper we focus on the intra-node, shared-memory case. Prob¬ 
ably, the best known parallel algorithm is the one by Jones and Plassmann [13], 
which in turn is an improved version of the original Maximal Independent Set 
algorithm by Luby [15]. There also exists a modified version of Jones-Plassmann 
which uses multiple hashes to minimize thread synchronization [3]. A parallel 
greedy coloring algorithm based on speculative execution was introduced by Ge- 
bremedhin and Manne [9]. Qatalyiirek et al. presented an improved version of 
the original speculative algorithm in [1] (§2.2). We took the latter one step fur¬ 
ther, devising a method which runs under an even more speculative scheme with 
less thread synchronization (§3), without compromising coloring quality. 

It must be pointed out that First-Fit variants which use ordering heuristics 
were not considered here. Despite recent innovations by Hasenplaugh et al. [11], 
those variants take considerably longer to run than the plain greedy algorithm 
and in many cases do not achieve a sufficiently large improvement in the number 
of colors to justify their cost. Runtime of coloring for the purpose of dynamic 
dependency analysis becomes a serious consideration in problems like morph 
algorithms [17], which mutate graph topology in non-trivial ways and constantly 
invalidate existing colorings. In those cases, the graph has to be recolored in every 
iteration of the morph kernel, so coloring becomes a recurring cost rather than a 
one-off preprocessing step. As shown in [11], heuristic-based algorithms, although 
achieving some reduction in the number of colors, take 4x-llx longer to run and 
this would dominate the kernel’s runtime. A notable example is the edge-swap 
kernel from our mesh adaptivity framework PRAgMaTIc 3 [10], in which coloring 
(using our fast method) already takes up 10% of the total execution time. 

The rest of this paper is organized as follows: In Section 2 we present the 
serial greedy coloring algorithm and its parellcl implementation by Qatalyiirek 
et al.. We explain how the latter can be improved further, leading to our imple¬ 
mentation which is described in Section 3 and evaluated against its predecessor 
in Section 4. Finally, we briefly explain why the class of optimistic coloring algo¬ 
rithms is unsuitable for SIMT-style parallel processing systems in Section 5 and 
conclude the paper in Section 6. 

3 https://github.com/meshadaptation/pragmatic 



2 Background 


In this section we describe the greedy coloring algorithm and its parallel version 
proposed by Qatalyiirek et al.. 

2.1 First-Fit Coloring 

Coloring a graph with the minimal number of colors has been shown to be an 
NP-hard problem [8]. However, there exist heuristic algorithms which color a 
graph in polynomial time using relatively few colors, albeit not guaranteeing an 
optimal coloring. One of the most common polynomial coloring algorithms is 
First-Fit, also known as greedy coloring. In its sequential form, First-Fit visits 
every vertex and assigns the smallest color available, i.e. not already assigned to 
one of the vertex’s neighbors. The procedure is summarized in Algorithm 1. 


Algorithm 1 Sequential greedy coloring algorithm. 
Input: Q{V,E) 
for all vertices Vi € Q do 

C <r- {colors of all colored vertices Vj £ adj(Vi)} 
c(Vj) «— {smallest color 0 C} 


It is easy to give an upper bound on the number of colors used by the greedy 
algorithm. Let us assume that the highest-degree vertex V/, in a graph has degree 
d, i.e. this vertex has d neighbors. In the worst case, each neighbor has been 
assigned a unique color; then one of the colors {l,2,...,d+l} will be available 
to Vh (i.e. not already assigned to a neighbor). Therefore, the greedy algorithm 
can color a graph with at most d + 1 colors. In fact, experiments have shown 
that First-Fit can produce near-optimal colorings for many classes of graphs [4]. 

2.2 Optimistic Coloring 

Gebremedhin and Manne introduced an optimistic approach to parallelizing the 
greedy graph coloring algorithm [9]. They described a fast and scalable version 
for shared-memory systems based on the principles of speculative (or optimistic) 
execution. The idea is that we can color all vertices in parallel using First-Fit 
without caring about race conditions at first (stage 1); this can lead to defective 
coloring, i.e. two adjacent vertices might get the same color. Defects can then 
be spotted in parallel (stage 2) and fixed by a single thread (stage 3). 

Picking up where Gebremedhin and Manne left off, Qatalyiirek et al. im¬ 
proved the original algorithm by removing the sequential conflict-resolution stage 
and applying the first two parallel stages iteratively. This work was presented 
in [1]. Each of the two phases, called tentative coloring phase and conflict de¬ 
tection phase respectively, is executed in parallel over a relevant set of vertices. 
Like the original algorithm by Gebremedhin and Manne, the tentative coloring 






phase produces a pseudo-coloring of the graph, whereas in the conflict detection 
phase threads identify defectively colored vertices and append them into a list 
C. Instead of resolving conflicts in £ serially, C now forms the new set of vertices 
over which the next execution of the tentative coloring phase will iterate. This 
process is repeated until no conflicts are encountered. 


Algorithm 2 The parallel graph coloring algorithm by Qatalyiirek et cil. 

Input: Q(V,E) 

U^V 

while U ^ 0 do 

^pragma omp parallel for > Phase 1 - Tentative coloring (in parallel) 

for all vertices V) G U do > execute First-Fit 

C <— {colors of all colored vertices V : j G adj(Vi)} 
c( Vi) «— {smallest color 0 C} 

^pragma omp barrier 

£ <— 0 t> global list of defectively colored vertices 

^pragma omp parallel for t> Phase 2 - Conflict detection (in parallel) 

for all vertices Vi £U do 

if 3Vj G adj{Vi),Vj > Vi : c(Vj) == c{Vi) then 

£ <— £ U Vi > mark V as defectively colored 

^pragma omp barrier 

U <— £ I> Vertices to be re-colored in the next round 


Algorithm 2 summarizes this coloring method. As can be seen, there is no 
sequential part in the whole process. Additionally, speed does not come at the 
expense of coloring quality. The authors have demonstrated that this algorithm 
produces colorings using about the same number of colors as the serial greedy 
algorithm. However, there is still a source of sequentiality, namely the two thread 
synchronization points in every iteration of the while-loop. Synchronization can 
easily become a scalability barrier for high numbers of threads and should be 
minimized or eliminated if possible. 

3 Implementation 

Moving toward the direction of removing as much thread synchronization as 
possible, we improved the algorithm by Qatalylirek et al. by eliminating one of 
the two barriers inside the while-loop. This was achieved by merging the two 
parallel for-loops into a single parallel for-loop. We observed that when a vertex 
is found to be defective it can be re-colored immediately instead of deferring 
its re-coloring for the next round. Therefore, the tentative-coloring and conflict- 
detection phases can be combined into a single detect- and-recolor phase in which 
we inspect all vertices which were re-colored in the previous iteration of the while- 
loop. Doing so leaves only one thread synchronization point per round, as can 
be seen in Algorithm 3. This barrier guarantees that any changes committed by 
a thread are made visible system-wide before proceeding to the next round. 






Algorithm 3 The improved parallel graph coloring technique. 

Input: Q(V, E) 

^pragma omp parallel for > perform tentative coloring on Q; round 0 

for all vertices Vi € Q do 

C <— {colors of all colored vertices Vj £ adj(V ;)} 
c(Vi) <— {smallest color 0 C} 

^pragma omp barrier 

IA° «— V > mark all vertices for inspection 

it— 1 > round counter 

while it 1-1 ^ 0 do > 3 vertices (re-)colored in the last round 

£ t— 0 > global list of defectively colored vertices 

^pragma omp parallel for 
for all vertices V, £ U' , ~ 1 do 

if 3 Vj € adj(Vi),Vj > Vi : c(Vj ) == c(U) then > if they are (still) defective 
C t— {colors of all colored Vj £ adj(Vi)} > re-color them 

c(Vi) t— {smallest color 0 C} 

£ t— £ U Vj > V< was re-colored in this round 


^pragma omp barrier 

Ui <— £ > Vertices to be inspected in the next round 

i <— i + 1 > proceed to the next round 


4 Experimental Results 

In order to evaluate our improved coloring method, henceforth referred to as 
Reduced Synchronization Optimistic Coloring (RSOC), and compare it to the 
previous state-of-the-art technique by Qatalyiirek et al ., we ran a series of bench¬ 
marks using 2D and 3D meshes of triangular and tetrahedral elements respec¬ 
tively (commonly used in finite element and finite volume methods), alongside 
randomly generated graphs using the R-MAT graph generation algorithm [2]. 
Simplicial 2D/3D meshes are used in order to measure performance and scala¬ 
bility for our target application area ([10]), whereas RMAT graphs were used for 
consistency with the experimental methodology used in Qatalyiirek et al .'s pub¬ 
lication; the authors state that those RMAT graphs “are designed to represent 
instances posing varying levels of difficulty for the performance of multithreaded 
coloring algorithms” [1]. 

For the 2D case we have used a 2D anisotropic mesh (adapted to the require¬ 
ments of some CFD problem) named mesh.2d, which consists of « 250fc vertices. 
We also evaluate performance using two 3D meshes, taken from the University of 
Florida Sparse Matrix Collection [5]. bmw3_2 is a mesh modelling a BMW Series 
3 car consisting of « 227 k vertices, whereas pwtk represents a pressurized wind 
tunnel and consists of « 218fc vertices. Finally, we generated three 16M-vertex, 
128Af-edge RMAT graphs, namely RMAT-ER (Erdos-Renyi), RMAT-G (Good) and 
RMAT-B (Bad), randomly shuffling vertex indices so as to reduce the benefits of 
data locality and large caches. For more information on those graphs the reader 
is referred to the original publication by Qatalyiirek et al. [1]. 





The experiments were run on two systems: a dual-socket Intel®Xeon® E5- 
2650 system (Sandy Bridge, 2.00GHz, 8 physical cores per socket, 2-way hyper¬ 
threading) running Red Hat®Enterprise Linux® Server release 6.4 (Santiago) 
and an Intel®Xeon Phi™ 5110P board (1.053GHz, 60 physical cores, 4-way 
hyper-threading). Both versions of the code (intel64 and mic) were compiled 
with Intel®Composer XE 2013 SP1 and with the compiler flags -03 -xAVX. 
The benchmarks were run using Intel®’s thread-core affinity support. 

Table 1 shows the average execution time over 10 runs of both algorithms on 
the 2 systems, Intel®Xeon® and Intel®Xeon Phi 7 ' , using the 3 finite elemen¬ 
t/volume meshes and the 3 RMAT graphs. Rows preceded by “C” correspond 
to the algorithm by Qatalyiirek et al, rows preceded by “R” pertain to the im¬ 
proved version. Timings for the meshes are given in milliseconds whereas for the 
RMAT graphs they are in seconds. As can be seen, RSOC perforins faster than 
Qatalyiirek et al. for every test graph on both platforms, while scaling better as 
the number of threads increases, especially on Intel®Xeon Phi™. 
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Table 1 . Execution time of both algorithms on 2 different platforms, Intel®Xeon® 
and Intel®Xeon Phi , with varying number of OpenMP threads and using the 3 finite 
element/volume meshes and the 3 RMAT graphs. Rows preceded by “C” correspond 
to the algorithm by (datalyiirek et al., rows preceded by “R” pertain to the improved 
version. Timings for the meshes are given in milliseconds whereas for the graphs they 
are in seconds. 


Figures 1 and 2 show the relative speedup of RSOC over Qatalyiirek et al. 
for all test graphs on Intel®Xeon® and Intel®Xeon Phi™, respectively, i.e. 
how much faster our implementation is than its predecessor for a given number 
of threads. With the exception of RMAT-ER and RMAT-G on which there is no 
difference in performance, the gap between the two algorithms widens as the 
number of threads increases, reaching a maximum value of 50% on Intel®Xeon 
Phi™ for RMAT-B. 

Looking at the total number of coloring conflicts encountered throughout 
the execution of both algorithms as well as the number of iterations each algo- 
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Fig. 1. Speedup of RSOC relative to Qatalyiirek et al. as the number of threads in¬ 
creases on Intel®Xeon® E5-2650. 
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Fig. 2. Speedup of RSOC relative to Qatalyiirek et al. as the number of threads in¬ 
creases on Intel®Xeon Phi™ 5110P. 




















































































rithm needs in order to resolve them, we can identify an additional source of 
speedup for our algorithm (apart from the absence of one barrier). We will use 
the Intel®Xeon Phi™ system for this study, as it is the platform on which the 
most interesting results have been observed. Figures 3 and 4 depict the total 
number of conflicts for the three meshes and the RMAT graphs, respectively. 
When using few threads both algorithms produce about the same number of 
conflicts. However, moving to higher levels of parallelism reveals that RSOC 
results in much fewer defects in coloring for certain classes of graphs. 

This observation can be explained as follows: In Qatalyiirek et al. all threads 
synchronize before entering the conflict-resolution phase, which means that they 
enter that phase and start resolving conflicts at the very same time. Therefore, 
it is highly possible that two adjacent vertices with conflicting colors will be 
processed by two threads simultaneously, which leads once again to new defects. 
In our improved algorithm, on the other hand, a conflict is resolved as soon as 
it is discovered by a thread. The likelihood that another thread is recoloring a 
neighboring vertex at the same time is certainly lower than in Cjatalylirek et al.. 
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Fig. 3. Number of conflicts on Intel®Xeon Phi 5110P using mesh2d, bmw3_2 and pwtk. 


The reduced number of conflicts also results in fewer iterations of the algo¬ 
rithm, as can be seen in Figures 5 and 6. Combined with the absence of one 
barrier from the while-loop, it is only expected that our new algorithm ulti¬ 
mately outperforms its predecessor. A nice property is that both algorithms 
produce colorings using the same number of colors, i.e. quality of coloring is not 
compromised by the higher execution speed. 
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Fig. 4. Number of conflicts on Intel®Xeon Phi ™ 5110P using RMAT-ER, RMAT-G and 
RMAT-B. 
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Fig. 5. Number of iterations on Intel®Xeon Phi 5110P using mesh2d, bmw3_2 and 
pwtk. 
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Fig. 6. Number of iterations on Intel®Xeon Phi™ 5110P using RMAT-ER, RMAT-G and 
RMAT-B. 

5 SIMT restrictions 

Trying to run the optimistic coloring algorithms using CUDA on an Nvidia GPU 
revealed a potential weakness. Neither algorithm terminated; instead, threads 
spun forever in an infinite loop. This is due to the nature of SIMT-style multi¬ 
threading, in which the lockstep warp execution results in ties never being bro¬ 
ken. An example of why these algorithms result in infinite loops in SIMT-style 
parallelism can be seen in Figure 7, where we have a simple two-vertex graph 
and two threads, each processing one vertex (this scenario is likely to actually 
occur at a later iteration of the while-loop, where the global list of defects C is 
left with a few pairs of adjacent vertices). At the beginning (a), both vertices 
are uncolored. Each thread decides that the smallest color available for its own 
vertex is red. Both threads commit their decision at the same clock cycle, which 
results in the defective coloring shown in (b). In the next round the threads try 
to resolve the conflict and decide that the new smallest color available is green. 
The decision is committed at the same clock cycle, resulting once again in defects 
(c) and the process goes on forever. 

Theoretically, this scenario is possible for CPUs as well, although the prob¬ 
ability is extremely low. We believe that there will always be some randomness 
(ie. lack of thread coordination) on CPUs which guarantees convergence of the 
optimistic algorithms. This randomness can also be “emulated” on GPUs by 
having a dynamic assignment of vertices to threads and making sure that two 
adjacent vertices are always processed by threads of different warps. 














































(a) Graph 


(b) Round 1 


(c) Round 2 


(d) Round 3 


Fig. 7. Example of an infinite loop in SIMT-style parallelism when using one of the 
optimistic coloring algorithms. 


6 Conclusions 

In this article we presented an older parallel graph coloring algorithm and showed 
how we devised an improved version which outperforms its predecessor, being 
up to 50% faster for certain classes of graphs and scaling better on manycore 
architectures. The difference becomes more pronounced as we move to graphs 
with higher-degree vertices (3D meshes, RMAT-B graph). 

This observation also implies that our method (with the appropriate exten¬ 
sions) could be a far better option for d-distance colorings of a graph Q, where Q d 
is considerably more densely connected than Q (graph Q d , the d th power graph 
of Q , has the same vertex set as Q and two vertices in Q d are connected by an 
edge if and only if the same vertices are within distance d in Q). 

Speed and scalability stem from two sources, (a) reduced number of conflicts 
which also results in fewer iterations and (b) reduced thread synchronization 
per iteration. Coloring quality remains at the same levels as in older parallel 
algorithms, which in turn are very close to the serial greedy algorithm, meaning 
that they produce near-optimal colorings for most classes of graphs. 
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